Skip to content

Fix CallawaySantAnna propensity score estimation (IRLS)#202

Merged
igerber merged 6 commits intomainfrom
csa-review
Mar 16, 2026
Merged

Fix CallawaySantAnna propensity score estimation (IRLS)#202
igerber merged 6 commits intomainfrom
csa-review

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 15, 2026

Summary

  • Replace BFGS-based logistic regression with IRLS (Fisher scoring) for propensity score estimation, matching R's glm(family=binomial) algorithm
  • Add solve_logit() and _check_propensity_diagnostics() to the unified linalg.py backend
  • Add configurable pscore_trim parameter to CallawaySantAnna
  • Fix silent fallback in the doubly-robust path (now emits warning when propensity estimation fails)
  • Remove duplicated _logistic_regression() from staggered.py and triple_diff.py

Methodology references (required if estimator / math changes)

  • Method name(s): Callaway & Sant'Anna (2021) — propensity score estimation component
  • Paper / source link(s): https://doi.org/10.1016/j.jeconom.2020.12.001
  • Any intentional deviations from the source (and why): Uses IRLS (Fisher scoring) for propensity score estimation, consistent with R's did::att_gt() which uses glm(family=binomial) internally. The previous BFGS implementation converged to different coefficients under near-separation, producing inflated ATT estimates.

Validation

  • Tests added/updated: test_linalg.py (8 new), test_staggered.py (7 new), test_methodology_triple_diff.py (mock fix)
  • External validation: Tested against Dias and Fontes (2024) Brazil dataset with poptotaltrend covariate. IRLS produces |ATT|=0.86 (within R reference range 0.45-1.15), vs old BFGS inflated estimates.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Switch CallawaySantAnna and TripleDifference from BFGS-based logistic
regression to IRLS (Fisher scoring), matching R's glm(family=binomial).
Under near-separation (e.g., large-scale covariates like population
trends), BFGS converges to different coefficients than IRLS, producing
inflated ATT estimates. IRLS matches the reference implementations.

Key changes:
- Add solve_logit() and _check_propensity_diagnostics() to linalg.py
- Add configurable pscore_trim parameter to CallawaySantAnna
- Fix silent DR fallback (now emits warning)
- Remove duplicated _logistic_regression() from staggered.py and
  triple_diff.py
- Document IRLS algorithm and diagnostics in REGISTRY.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

Overall Assessment

⚠️ Needs changes

Executive Summary

  • Affected methodology: Callaway-Sant'Anna’s propensity-score component is the primary change; TripleDifference is only affected through the shared solve_logit() backend refactor.
  • The IRLS switch itself is documented in the Methodology Registry, so I did not treat BFGS→IRLS as a methodology defect.
  • P1: the new public pscore_trim parameter is accepted without any range validation and then fed directly into clipping/weight formulas in the CS IPW/DR paths.
  • P1: pscore_trim is not propagated into CallawaySantAnnaResults, so two runs with different trimming settings produce indistinguishable result objects.
  • P2: the shared solve_logit() hardcodes warn-and-drop behavior for rank-deficient propensity models and cannot honor estimator-level rank_deficient_action.

Methodology

  • P3 Informational: the IRLS/fallback/trimming behavior is explicitly documented in the registry, so the algorithm switch itself is not an undocumented methodology deviation. Impact: none; this is a documented implementation choice. Concrete fix: none required. References: docs/methodology/REGISTRY.md:L396-L405.
  • P1 The new pscore_trim API has no valid-range guard in CallawaySantAnna.__init__, even though the estimator now defines trimming as clipping to [pscore_trim, 1-pscore_trim]. Impact: invalid public inputs such as negative values or pscore_trim >= 0.5 flow straight into the clipping and IPW/DR weight calculations, so the estimator can compute meaningless weights instead of failing fast. Concrete fix: validate 0 <= pscore_trim < 0.5 in the constructor and add explicit negative / 0.5 / >0.5 tests. References: diff_diff/staggered.py:L241-L315, diff_diff/staggered.py:L1498-L1505, diff_diff/staggered.py:L1665-L1669.

Code Quality

  • P2 solve_logit() always warns and drops rank-deficient propensity-score columns, with no way for callers to propagate rank_deficient_action. Impact: the refactored shared PS backend can ignore user-selected rank_deficient_action="silent" or "error" on the modified IPW/DR paths in both CS and DDD, so the estimator contract is inconsistent across outcome-regression vs propensity-regression code paths. Concrete fix: add a rank_deficient_action argument to solve_logit() and thread it through the CS/DDD call sites, then add PS-path tests for "silent" and "error". References: diff_diff/linalg.py:L922-L932, diff_diff/staggered.py:L1477-L1484, diff_diff/triple_diff.py:L765-L779.

Performance

  • No findings.

Maintainability

  • P1 The new pscore_trim parameter is only partially wired downstream: fit() constructs CallawaySantAnnaResults without it, and the results dataclass has no field to record it. Impact: a methodology-affecting option changes the estimate, but downstream consumers cannot tell which trim setting produced a given result object, summary, or serialized table. Concrete fix: add pscore_trim to CallawaySantAnnaResults, populate it in fit(), and expose it anywhere other estimator settings are surfaced. References: diff_diff/staggered.py:L1339-L1358, diff_diff/staggered_results.py:L96-L114.

Tech Debt

  • No existing TODO.md entry mitigates the P1 items above. Impact: per the repo’s review policy, these remain blockers rather than deferrable follow-ups. Concrete fix: address them in this PR rather than deferring. References: TODO.md:L38-L66.

Security

  • No findings.

Documentation/Tests

  • P2 The PR adds solid unit coverage for solve_logit() and default clipping behavior, but the new public pscore_trim option only gets accessor coverage and no non-default estimator-level coverage through CS aggregation/bootstrap paths. Impact: regressions in non-default trim behavior would reach users without any test signal, especially for aggregate="all" and bootstrap inference where trimmed weights feed downstream influence functions. Concrete fix: add IPW and DR tests with pscore_trim != 0.01 covering overall, event-study, group, and bootstrap outputs, and assert the trim setting is preserved in results. References: tests/test_staggered.py:L600-L650, tests/test_staggered.py:L3167-L3186, tests/test_staggered.py:L3189-L3337.

Path to Approval

  1. Validate pscore_trim in CallawaySantAnna.__init__ and reject invalid bounds before any weighting code runs.
  2. Thread pscore_trim into CallawaySantAnnaResults so result objects remain self-describing.
  3. Add non-default pscore_trim estimator tests for CS IPW and DR with aggregate="all" and n_bootstrap > 0.

Verification note: I could not run the test suite in this environment because the Python dependencies are not installed (numpy is missing).

…it rank_deficient_action

- Validate pscore_trim in [0, 0.5) in CallawaySantAnna.__init__ (P1)
- Add pscore_trim field to CallawaySantAnnaResults and pass from fit() (P1)
- Add rank_deficient_action parameter to solve_logit() with warn/error/silent (P2)
- Thread rank_deficient_action through staggered.py and triple_diff.py call sites (P2)
- Add tests for validation, results propagation, non-default trim, and rank_deficient_action

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 15, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 46077d45d56805529b7be98d84d2668c560823f6


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Affected methods: Callaway-Sant'Anna IPW/DR propensity-score estimation and TripleDifference pairwise IPW/DR propensity-score estimation.
  • The IRLS switch itself is no longer a methodology defect. For Callaway-Sant'Anna, the Methodology Registry now explicitly documents IRLS, trimming, and fallback behavior.
  • The prior pscore_trim findings are resolved: range validation is present, CallawaySantAnnaResults now records the trim setting, and non-default trim tests were added.
  • P1 [Previous finding partially unresolved]: rank_deficient_action="error" is still converted into unconditional-propensity fallback in the PS-based paths, so the estimators do not honor the requested fail-fast behavior.
  • Test coverage still misses that estimator-level error-mode path, which is why the regression remains.

Methodology

  • P3 Informational. The optimizer change from BFGS to IRLS is not a defect in this re-review. For Callaway-Sant'Anna, the registry now documents IRLS, trimming, and fallback explicitly, and TripleDifference’s registry requirements only constrain trimming/overlap/fallback behavior rather than the specific optimizer. Impact: none. Concrete fix: none. References: docs/methodology/REGISTRY.md:L397-L405, docs/methodology/REGISTRY.md:L1186-L1202

Code Quality

  • P1 [Previous finding partially unresolved]. solve_logit() correctly raises on rank-deficient logistic designs when rank_deficient_action="error", but the changed PS wrappers in CallawaySantAnna IPW/DR and TripleDifference IPW/DR immediately catch that ValueError and replace the propensity model with unconditional scores. Impact: with collinear PS covariates, a user who explicitly asked for fail-fast behavior can instead get a different estimator and different SEs, only gated by a warning/fallback. Concrete fix: in these PS paths, re-raise ValueError when self.rank_deficient_action == "error" (or do not catch ValueError from solve_logit() in that mode), and keep fallback only for true numerical failures / non-error modes. References: diff_diff/linalg.py:L936-L945, diff_diff/staggered.py:L1481-L1499, diff_diff/staggered.py:L1655-L1672, diff_diff/triple_diff.py:L768-L783

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • P3 Informational. I did not find a TODO.md entry that tracks the remaining PS error-mode issue, so it is not mitigated by the project’s deferred-work policy. Impact: none beyond assessment accounting. Concrete fix: address it in this PR rather than defer. Reference: TODO.md:L38-L49

Security

No findings.

Documentation/Tests

Path to Approval

  1. In diff_diff/staggered.py:L1481-L1499 and diff_diff/staggered.py:L1655-L1672, stop converting solve_logit() rank-deficiency ValueError into unconditional fallback when rank_deficient_action="error".
  2. Make the same change in diff_diff/triple_diff.py:L768-L783 so TripleDifference IPW/DR also honors rank_deficient_action="error".
  3. Add regression tests for those PS paths with duplicated covariates, asserting ValueError under "error" and warning-backed fallback only under the default non-error modes.

Verification note: I could not execute the test suite in this environment because numpy is not installed.

Re-raise ValueError in IPW/DR except blocks when rank_deficient_action
is "error" instead of silently falling back to unconditional estimation.
Applies to CallawaySantAnna and TripleDifference PS paths.

Add estimator-level regression tests asserting ValueError propagation
with collinear covariates under error mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 15, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 96c71de55897a2468331c8a76af5774e58c52828


Overall Assessment

⚠️ Needs changes

The previous blocker around rank_deficient_action="error" silently falling back is resolved. The remaining blocker is a new edge-case in the Callaway-Sant’Anna weighting path: this PR newly accepts pscore_trim=0.0, but the shared IRLS helper can still emit exact boundary scores, and the CS IPW/DR code divides by 1 - pscore without an open-interval guard.

Executive Summary

Methodology

Code Quality

No findings. The prior rank_deficient_action="error" regression is fixed in the changed PS wrappers and covered by new estimator-level tests. See diff_diff/staggered.py:L1480-L1493, diff_diff/staggered.py:L1657-L1668, diff_diff/triple_diff.py:L768-L776, tests/test_staggered.py:L1278-L1316, and tests/test_methodology_triple_diff.py:L1560-L1579.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • P3. Documentation is not fully synced with the new PS behavior: the registry fallback note still reads as though any IRLS ValueError falls back to unconditional propensity scores, but code now intentionally re-raises in rank_deficient_action="error" mode; separately, the CS estimator/results docstrings do not document the new pscore_trim API surface. Impact: minor docs/API drift, no runtime defect. Concrete fix: qualify the registry fallback note as applying only outside "error" mode, and add pscore_trim to the CallawaySantAnna and CallawaySantAnnaResults docstrings. References: docs/methodology/REGISTRY.md:L396-L405, diff_diff/staggered.py:L120-L167, diff_diff/staggered.py:L241-L255, diff_diff/staggered_results.py:L70-L117.

Path to Approval

  1. Decide whether pscore_trim=0.0 is supported for Callaway-Sant’Anna. If not, tighten validation in diff_diff/staggered.py:L255-L269 to require 0 < pscore_trim < 0.5.
  2. If 0.0 must remain accepted, clamp propensity scores to an open interval before the CS IPW/DR weight calculations in diff_diff/staggered.py:L1507-L1514 and diff_diff/staggered.py:L1678-L1684.
  3. Add regression tests that hit the boundary case for both CS ipw and dr: near-separation data (or a mocked PS vector containing 1.0) with pscore_trim=0.0, asserting either a ValueError if zero-trim is disallowed or finite, guarded outputs if it remains supported.

Verification note: this review is by diff inspection only; I could not execute the test suite in this sandbox because numpy is unavailable.

- Change pscore_trim validation from [0, 0.5) to (0, 0.5) since zero
  trimming allows exact 1.0 propensity scores through to the weight
  formula pscore/(1-pscore), producing inf/NaN
- Update REGISTRY.md fallback note to document error-mode re-raise
- Add pscore_trim to CallawaySantAnna and CallawaySantAnnaResults docstrings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 15, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: b151613d625933629c8592fd05ad9c4b9705edfb


Overall Assessment

⚠️ Needs changes

The previous zero-trim blocker is fixed at construction time, but one approval-blocking hole remains: the new public set_params() path can still inject unsupported pscore_trim values into the Callaway-Sant’Anna IPW/DR weight formulas.

Executive Summary

  • The prior pscore_trim=0.0 constructor issue is partially resolved: CallawaySantAnna.__init__() now enforces 0 < pscore_trim < 0.5, and the CS IPW/DR paths clip using self.pscore_trim.
  • Remaining P1: CallawaySantAnna.set_params() blindly assigns pscore_trim, so the supported sklearn-style API can still bypass that validation and reopen the zero-trim / invalid-trim weight path.
  • I did not find a new SE/variance or NaN-propagation regression in the touched estimator code; the changed CS/DDD inference paths still use the existing safe inference / IF-based machinery.
  • The CS IRLS switch itself is documented in the Methodology Registry and is not a defect.
  • Minor P3: the TripleDifference registry text still says PS-estimation failure falls back unconditionally, but the new code/tests now re-raise in rank_deficient_action="error" mode.

Methodology

  • Severity: P1 [Newly identified]. Impact: pscore_trim is validated only in __init__(), but this PR also exposes it through get_params() and explicitly blesses set_params(pscore_trim=...) in tests. That means CallawaySantAnna().set_params(pscore_trim=0.0) can still bypass the new open-interval guard, after which fit() feeds 0.0 into the CS IPW/DR clipping and p/(1-p) weight formulas. Under near-separation, that reopens the exact undefined-boundary-weight path this PR was supposed to close. Concrete fix: centralize parameter validation so it also runs after set_params() (or at fit() entry), and add regression coverage for invalid set_params(pscore_trim=0.0) / negative / >=0.5 on both CS ipw and dr. References: diff_diff/staggered.py:L248-L273, diff_diff/staggered.py:L1746-L1753, diff_diff/staggered.py:L1507-L1516, diff_diff/staggered.py:L1682-L1688, tests/test_staggered.py:L3217-L3246.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Revalidate pscore_trim after set_params() or at the start of CallawaySantAnna.fit() so unsupported values cannot reach the IPW/DR clipping and weight formulas.
  2. Add regression tests that call CallawaySantAnna().set_params(pscore_trim=0.0) and at least one other invalid value, then fit with both estimation_method="ipw" and "dr", asserting ValueError before any ATT/SE computation.

Review based on diff/source inspection only; I did not execute the test suite in this sandbox.

set_params() can inject invalid pscore_trim values since it bypasses
__init__ validation. Add check at fit() entry to catch 0.0, negative,
and >=0.5 values before they reach IPW/DR weight formulas.

Also update TripleDifference registry fallback note for error mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 15, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: e42d5b12cdcac16cbd38caab4a077fbde7a4adad


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not see a relevant pre-existing tracked limitation in TODO.md:L42-L63 that changes the assessment here.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the main changed methodology path is covariate-adjusted propensity estimation, but the direct Callaway-Sant’Anna R helpers still hard-code xformla = ~ 1, so they never exercise the new IRLS-backed covariate path. The new PR tests in tests/test_staggered.py:L3431-L3487 and tests/test_staggered.py:L3515-L3570 only check wide plausibility bounds, which would miss smaller but still material drifts. The current R helpers remain unconditional in tests/test_methodology_callaway.py:L376-L389 and tests/test_methodology_callaway.py:L1657-L1669. Concrete fix: extend the R helper to accept covariates and add at least one covariate-adjusted dr and ipw benchmark against did::att_gt(), then tighten the new CS assertions to compare against that reference instead of broad range checks.

Review based on diff/source inspection only; I did not execute the test suite in this sandbox.

Track P3 from PR #202 review: CS R helpers hard-code xformla = ~ 1,
so the IRLS covariate path lacks tight R-backed regression tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit 1d3f5a5 into main Mar 16, 2026
10 checks passed
@igerber igerber deleted the csa-review branch March 16, 2026 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant